Effective Listings of Function Stop words for Twitter
نویسنده
چکیده
Many words in documents recur very frequently but are essentially meaningless as they are used to join words together in a sentence. It is commonly understood that stop words do not contribute to the context or content of textual documents. Due to their high frequency of occurrence, their presence in text mining presents an obstacle to the understanding of the content in the documents. To eliminate the bias effects, most text mining software or approaches make use of stop words list to identify and remove those words. However, the development of such top words list is difficult and inconsistent between textual sources. This problem is further aggravated by sources such as Twitter which are highly repetitive or similar in nature. In this paper, we will be examining the original work using term frequency, inverse document frequency and term adjacency for developing a stop words list for the Twitter data source. We propose a new technique using combinatorial values as an alternative measure to effectively list out stop words. KeywordsStop words; Text mining; RAKE; ELFS; Twitter.
منابع مشابه
Preprocessing Egyptian Dialect Tweets for Sentiment Mining
Research done on Arabic sentiment analysis is considered very limited almost in its early steps compared to other languages like English whether at document-level or sentence-level. In this paper, we test the effect of preprocessing (normalization, stemming, and stop words removal) on the performance of an Arabic sentiment analysis system using Arabic tweets from twitter. The sentiment (positiv...
متن کاملA Data Analytic Framework for Unstructured Text Hassanin
This paper describes a systematic flow of the unstructured data in industry, collected data, stored data, and the amount of data. Big data uses salable storage index and distributed approach to retrieve required information. Therefore, the paper introduces an unstructured data framework for managing and discovering using the 3Vs of big data: variety, velocity, and volume. Different approaches f...
متن کاملEstimating the Parameters for Linking Unstandardized References with the Matrix Comparator
This paper discusses recent research on methods for estimating configuration parameters for the Matrix Comparator used for linking unstandardized or heterogeneously standardized references. The matrix comparator computes the aggregate similarity between the tokens (words) in a pair of references. The two most critical parameters for the matrix comparator for obtaining the best linking results a...
متن کاملA High-Performance Model based on Ensembles for Twitter Sentiment Classification
Background and Objectives: Twitter Sentiment Classification is one of the most popular fields in information retrieval and text mining. Millions of people of the world intensity use social networks like Twitter. It supports users to publish tweets to tell what they are thinking about topics. There are numerous web sites built on the Internet presenting Twitter. The user can enter a sentiment ta...
متن کاملPolarity classification for Spanish tweets using the COST corpus
It was not until 2010 when businesses, politicians and people in general began to realise the potential of Twitter in Spain. This fact has awoken research interest in the extraction of knowledge from Twitter. This paper aims to fill the gap of the lack of resources for Twitter sentiment analysis in Spanish by performing a study of different features and machine learning algorithms for classifyi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1205.6396 شماره
صفحات -
تاریخ انتشار 2012